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Abstract: In the spirit of modeling inference for microarrays as multiple test- 
ing for sparse mixtures, we present a similar approach to a simplified version 
of quantitative trait loci (QTL) mapping. Unlike in case of microarrays, where 
the number of tests usually reaches tens of thousands, the number of tests per- 
formed in scans for QTL usually does not exceed several hundreds. However, 
in typical cases, the sparsity p of significant alternatives for QTL mapping is in 
the same range as for microarrays. For methodological interest, as well as some 
related applications, we also consider non-sparse mixtures. Using simulations 
as well as theoretical observations we study false discovery rate (FDR), power 
and misclassification probability for the Benjamini-Hochberg (BH) procedure 
and its modifications, as well as for various parametric and nonparametric 
Baycs and Parametric Empirical Bayes procedures. Our results confirm the 
observation of Genovese and Wasserman (2002) that for small p the misclas- 
sification error of BH is close to optimal in the sense of attaining the Bayes 
oracle. This property is shared by some of the considered Baycs testing rules, 
which in general perform better than BH for large or moderate p's. 



1. Introduction 

Multiple tests have received considerable attention recently because of application 
to microarrays, where one simultaneously tests a few thousands (m) of null hypothe- 
ses with only a small proportion (p) of signals, i.e., possibly significant alternatives. 
Some recent references are Benjamini and Hochberg [1], Efron et al. [8], Efron and 
Tibshirani [7], Storey et al. [32], Genovese and Wasserman [13], Miiller et al. [19], 
Sarkar [24] or Scott and Berger [25]. If one increases m further, say m = 10 6 , one 
would move from microarrays to problems of homeland security, see for example 
Donoho and Jin [6]. 

We wish to consider a still different scale, namely m in the range of a few hun- 
dreds, which is relevant for quantitative trait loci (QTL) mapping. In this setup we 
explore and compare different multiple testing rules, ranging from the Benjamini 
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and Hochberg [1] procedure [BH], Parametric Empirical Bayes [PEB] procedures, 
to the fully Bayes rule of Scott and Berger [25]. Also, included is a Bayesian non- 
parametric analysis based on Dirichlet mixtures as well as a novel application of 
a nonparametric algorithm for mixture estimation, due to Newton [21]. Our study 
is based on simulations as well as some theoretical observations. The magnitude of 
signals used in our simulation study is chosen according to the suggestions included 
in Donoho and Jin [6] , so as to fulfill the condition of detectability in very sparse 
mixtures. For each of the considered multiple testing procedures we study its power 
(expected value of percentage of correctly identified alternative hypotheses), false 
discovery rate (FDR) and misclassification error and compare them with properties 
of a Bayesian oracle. We pay special attention to BH since one motivation for our 
study was to see if we can come up with a better Bayes or PEB rule. 

Our results confirm the observation of Genovese and Wasserman [12] that for 
very small values of p's (for m = 200, p < 0.05) the misclassification error of 
BH is close to optimal in the sense of attaining a Bayesian oracle. This property is 
shared by some of the considered Bayes testing rules, which perform better than BH 
for larger p. Moreover, in Section 3 we demonstrate that controlling positive false 
discovery rate (pFDR) is equivalent to controlling Bayes risk with the loss function 
depending only on a and thus, somewhat unexpectedly, the rules to control FDR 
or pFDR have a strong Bayesian flavor. 

While our results provide some insight on QTL studies, much further work is 
needed to make our results directly applicable to actual QTL mapping. Our model- 
ing is similar to that of microarrays, whereas the QTL designs require more complex 
linear modeling than for microarrays. The related multiple testing problem, which 
arises when there are many predictors (markers) to choose from, was first addressed 
in Bogdan et al. [1], where a suitable modification of BIC, namely mBIC, is pro- 
posed. We believe that our current research throws some light on how mBIC can be 
further improved by implementing a less conservative multiple testing adjustment. 

The outline of the paper is as follows. In Section 2 we introduce our models and 
explain how some of them are related to QTL mapping. In Section 3 we discuss dif- 
ferent notions of error in multiple testing as well as the relationship between FDR 
controlling rules and Bayesian testing. The procedures considered in our study are 
described in Section 4, except for Bonferroni, which is described in Section 3. The 
results of simulations are given in Section 5. Section 6 contains some illustrations of 
the problem of nonidentifiability of parameters in the mixture model and justifica- 
tion for using the informative prior distribution on p. Section 7 contains our main 
conclusions. Some theoretical results on the performance of the parametric Bayes 
procedure and the nonparametric Bayes procedure based on Dirichlet mixtures are 
given in the Appendix. 

2. Models and implications for QTL mapping 

We consider a multiple testing problem, when the number of tests m is in the 
range of a few hundreds. Such values of m are of importance in QTL mapping and 
they have a methodological interest in that the asymptotic results of Genovese and 
Wasserman [13], Donoho and Jin [G] or Meinshausen and Rice [18] do not yet apply. 

We use the parametric model proposed in Scott and Berger [25]. Thus we consider 
m test statistics X\ , . . . , X m and assume that Xi has either the null distribution 
N(0,a 2 ) or the non-null distribution N(/ii,cr 2 ), where p,i ^ represents some 
signal (e.g. a QTL close to the i-th marker). The signal /ii is taken to be random, 
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distributed as N(0, t 2 ). Hence the non null distribution of Xj is N(0, a 2 + r 2 ). We 
also define a random indicator variable 7j, which is equal to 1 if Xi is generated 
by the non-null distribution (i.e. it represents the signal) or in the other case. If 
p = P(7, = 1), then the marginal distribution of X% is the scale mixture of normals, 
namely, 



Moreover, we assume that (X"t,7j), 1 < i < m, are i.i.d. random vectors. We 
will consider both sparse mixtures, with p < 0.2, and non-sparse mixtures, with a 
relatively large p. Usually we assume that p and r are not known, while a can be 
known or unknown, depending on the application. 

For each i we test whether Xj has a null or non null distribution, i.e. 



A major potential application of our model is QTL mapping. Our modeling takes 
into account the possibility that apart from QTL the trait can be influenced by a 
large number of polygenes, i.e. genes with very small effects, distributed over the en- 
tire genome. If our main interest is in identifying markers linked to QTL we consider 
a sparse mixture (2.1), where p is small and iV(0,<7 2 ) represents the distribution 
of the sum of polygenic and random (environmental) effects. In this context a is 
usually unknown. The second component in the mixture, namely N(0,a 2 + t 2 ), 
represents the distribution of the QTL effect, fa, and the sum of polygenic and 
random effects. Following the majority of Bayesian papers related to QTL map- 
ping (see e.g. Yi [33]) we use N(0,t 2 ) to model the distribution of /ij. Thus our 
model assumes that the probabilities of a positive and a negative QTL effect arc 
the same and is suitable in the situation when the analyzed trait is not the subject 
of a strong selection. Note that under this scenario detecting QTL is particularly 
difficult. Another plausible distribution for |/^| is the gamma distribution (see e.g. 
Otto and Jones [22]). A completely robust alternative is to model fXi's with a non- 
parametric distribution P and put a further prior P ~ Dirichlct, which leads to 
Dirichlct location mixture distribution for AVs. We investigate this approach and 
propose an alternative nonparametric inference based on Newton [21]. 

If our main interest is in both QTL and polygenic effects, the null component, 
iV(0,(7 2 ), represents the distribution of random effects, and N(0,t 2 ), represents 
the distribution of effects due to QTL and polygenes. In this setting p need not be 
small and a 2 may be assumed known, since we can precisely estimate it through 
replications. 

Remark 1. The number of strong QTL, which arc significantly different from the 
background of polygenes, is usually small. In this case only relatively large QTL 
effects, \fj,i\ > <Ty/2 logm, may be identified, since extreme values of the "null" 
component of the mixture are approximately equal to a\J2 logm. In order that 
such signals are generated by the non null component r 2 should be comparable or 
larger than 2er 2 logm. 

Remark 2. The assumption of the independence of X^ can be used when markers 
arc distant from each other. When markers are close to each other, the correspond- 
ing test statistics might be strongly correlated. However, the results reported in 
this paper demonstrate some general properties of the multiple testing procedures 
and show the directions in construction of related methods for detection of linked 



(2.1) 



X~ (l-p)iV(0,<7 2 ) + P N{Q,a 2 +t 2 ). 



(2.2) 



H 0i : 7i = vs H Ai : 7, = 1. 



QTL. 
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3. Different notions of error in multiple testing 

Consider the problem of testing of m hypothesis Hqi, . . . , Ho m , specified in (2.2). 
For each individual test two types of error can occur: the null hypothesis can be 
rejected even though it is true (type I error) or be not rejected when it is false (type 
II error). Following the notation of Bcnjamini and Hochberg [1], Table 1 defines 
variables describing counts of possible outcomes of a multiple testing procedure. 

The main focus of classical statistics is on tests minimizing the probability of the 
type II error (or maximizing the power), while controlling the probability of the type 
I error at a given significance level a. The natural extension of the type I error to the 
situation of testing m hypotheses is the family- wise error rate, FWER = P(V > 0). 
Additionally, the notion of power can be naturally extended to the multiple testing 
as E{-^\mi > 0). Here, as well as in the next part of the paper, E is used to denote 
the frequcntist expectation (i.e. conditional on the vector parameters of the model 
(2.1)). The classical approach to the multiple testing problem relies on constructing 
procedures maximizing the power while controlling FWER at a given level (see e.g. 
Holm [16]). 

In the situation when m is large, procedures controlling FWER are usually very 
conservative. Note that in many practical applications one would often accept false 
discoveries as long as they consist only a small proportion of all discoveries. Going 
along these practical expectations Seeger [26] elaborated on the idea of Eklund 
(unpublished seminar papers) and discussed a stepwise multiple testing procedure 
aimed at controlling the proportion of false discoveries among all discoveries. The 
same stepwise multiple testing procedure has been later discovered by Simes [27], 
who proved that it controls FWER in a weak sense (when all hypothesis are true). 
The notion of proportion of false discoveries appeared again in a paper by Soric 
[28]. Following this paper, Benjamini and Hochberg [1] formally defined the false 
discovery rate as FDR = E(^), where ^ = if R = 0. Benjamini and Hochberg 
also proved that the multiple testing procedure of Seeger and Simes controls FDR 
at a desired level when the test statistics are independent. Following Benjamini and 
Hochberg [1] this procedure gained a large popularity and is currently known as 
the Benjamini and Hochberg (BH) procedure. 

Let P(i) < P(2) . . • < P( m ) be the ordered p-values of m tests. Let 

f ia 1 

(3.1) k = max < i : P^ < — > . 

BH rejects all hypotheses for which the corresponding p-values are smaller than P^ . 
In Benjamini and Yekuticli [3] and Sarkar [23] it is proved that BH controls FDR 
also under certain forms of positive dependence between test statistics. Following 
Benjamini and Hochberg [1] many other criteria and procedures which allow for 
controlling a number or proportion of false discoveries, were developed (see e.g. 
Lchmann and Romano [17], Sarkar [24], Storey [31] and references given there) but 
BH still remains one of the most popular methods of multiple testing. 



Table 1 

Counts of possible outcomes of m hypothesis tests 
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Multiple testing problems can be approached also from the point of view of 
decision theory. Depending on the specifics of the problem, different loss functions 
can be assigned to the two types of errors and the procedure minimizing the related 
risk can be constructed. The corresponding procedures in the framework of Bayesian 
decision theory were discussed e.g. in Miiller et al. [19], Miillcr ct al. [20] or Scott 
and Berger [25]. Further references to Bayesian multiple testing procedures as well 
as a novel Bayesian stepwise multiple testing procedure can be found in Chen and 
Sarkar [5]. 

To point at some similarities between controlling FDR and Bayesian approach 
to multiple testing we now briefly discuss the positive false discovery rate, 

P FDR = E(l\R>^ FDR 



J P(i?>0)' 

defined in Storey [30] and Bayesian false discovery rate 

BFDR = P(H a is true|# is rejected), 

defined in Efron and Tibshirani [7]. 

Theorem 1 of Storey [30] states that in case when individual test statistics are 
generated by the two-component mixture model, like in our setting, pFDR = 
BFDR. It is also pointed out that there are situations in which BFDR can not 
be controlled. An obvious example is when p = 0, since then BFDR = pFDR = 1. 
It is however easy to show that in our testing problem (2.2) BFDR can be con- 
trolled at any given level a if p > 0. The corresponding threshold for the absolute 
value of the test statistic Xi is given by the formula 

(3.2) cfdr = inf <x > : - <a 

where $o an d F are cdfs of N(Q, a 2 ) and the mixture distribution (2.1), respectively. 

Remark 3. The difference between FDR and BFDR may be relatively large for 
small p and a small deviation between the null and alternative distribution (i.e. 
small power). However, in typical QTL or microarray experiments, where m is 
large and some rejections typically occur, the difference between BFDR and FDR 
is usually very small. Based on the asymptotic approximation of FDR by BFDR, 
Genovese and Wasscrman [13] call (3.2) an oracle threshold to control FDR. 

Remark 4. Theorem 5.1 of Benjamini and Yekutieli [3] states that if the test 
statistics are continuous and independent then FDR of BH is equal to amo/m. 
Thus FDR of BH is close to a only when mo is close to m and converges to when 
mo — > 0. When mo is known one can easily modify BH to control FDR at the level 
a by replacing k (see 3.1) with 

{%a 
i : P(D < — 
mo 

In Benjamini and Hochberg [2] a graphical method to estimate mo is proposed and 
the formula (3.3) is used to construct an adaptive version of BH. 

Under the mixture model (2.1) the expectation of mo is equal to m(l — p) and 
a corresponding modified version of BH can be obtained by replacing fcl with 

(3.4) k2 = max <^ i : P {l) < 



m(l — p) 
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Table 2 
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It is easy to prove that this version of BH also has FDR equal to a. Moreover, 
in Efron and Tibshirani [7] it is noticed that the modified BH (3.4) is equivalent 
to the BFDR controlling rule (3.2), with the cdf of the mixture distribution esti- 
mated by the empirical distribution function. In many consecutive papers (see e.g. 
Efron et al. [8], Efron and Tibshirani [7], Storey [29] and Genovese and Wasserman 
[13]) different nonparametric methods of the estimation of (1 — p) and F(x) were 
considered, leading to FDR controlling rules which are more liberal than BH. 

Let us now consider the multiple testing problem from the perspective of decision 
theory. Table 2 defines the specific matrix of losses for making the wrong decision. 

Let us denote by t\ and t 2 the probability of type I and type II errors of a single 
test. The Bayes risk related to the above matrix of losses is given by the following 
equation 

(3.5) BR So , Sa = S (l - P )h + S AP t 2 . 

The Bayes rule, i.e., the test which minimizes this risk, rejects the null hypothesis 

if 

fA{Xj) (1- P )6 

(3-6) , , > 7 , 

Jo(Xi) pb A 

where /o and j a are the densities of Xi under H and Ha, or equivalently if 

So 



(3.7) p l = P{H M \X i )> 



So + 5 A 



We call this test a Bayes oracle and compare other tests to this oracle. 
Let us observe that 



BFDR = 



0--p)h 



(l-p)tl+p(l-t 2 ) 

Thus 

(3.8) BFDR < a iff (1 - a)(l - p)h + apt 2 < ap 

and controlling BFDR controls the Bayes risk with a loss So = 1 — a and 5a = at. 
The classical flavor of BFDR is however strongly reflected in assigning much larger 
loss to the type I error than to the type II error. 

The accuracy of the multiple testing procedure can be judged by its misclas- 
sification probability, MP = B{V + T) ■ Note that MP = BR hl , where BR 1A is 
the Bayes risk corresponding to 0-1 loss. In our parametric setting (2.1) the Bayes 
oracle minimizing BR\_i rejects the hypothesis if 

2 2(a 2 +r> 2 / 1 f<J 2 + r 2 \ . 1-p 



(3.9) Xf > v , 2 ' ( - log ( — — ) + log 



In Figure 1 we compare the Bayes oracle (3.9) to BH and the standard test- 
ing procedure based on the Bonferroni correction. The significance level for each 
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individual test in Bonferroni procedure is equal to a/m. For this presentation 
as well as for simulations reported in Section 5 we use m = 200, a = 1 and 
t = \J2 * log(200) ~ 3.26. For BH and Bonferroni procedures a = 0.05. Apart 
from the standard BH we use its modified version, with the cutoff for p-valucs 
k2 specified by (3.4). The reported characteristics for the Bonferroni correction 
and Bayes oracle were obtained theoretically, while the characteristics of BH were 
computed using computer simulations, based on 10000 replicates. 

Figure 1 demonstrates that, as expected, the modified version of BH keeps FDR 
exactly at the level 0.05, while FDR of the original BH decreases linearly with p. 
Comparison of 1(a) and 1(b) shows that the difference between BFDR and FDR 
is substantial when p < 0.03. In particular, neither versions of BH controls BFDR 
in this range of p. This seems due to the fact that for very small p the threshold 
based on the empirical mixing distribution is substantially more liberal than the 
one provided by (3.2). Both versions of BH take an intermediate position between 
the Bonferroni procedure, which is most conservative, and the most liberal Bayes 
oracle. Figure 1 demonstrates that the most powerful Bayes oracle has also the 
largest FDR. However, as expected, type I and type II errors balance in such a way 
that the misclassification probability (MP) of the Bayes oracle is smaller than that 
of any other method. Interestingly, the modified version of BH performs very well in 
terms of MP over the entire range of p. When p is very small also the original BH 
has a very low MP, which for p = 0.015 is very close to the optimal value provided 
by the Bayes oracle. 




Fig 1. Characteristics of multiple testing procedures. 
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4. Bayes, parametric empirical Bayes and modified BH procedures 

PEB procedures: The natural way of applying the Bayes classifier (3.9) in the 
situation when parameters of (2.1) are unknown is to use their consistent estimates 
and plug them into (3.9). In particular, maximum likelihood estimators (MLE) 
could be considered. 
Let 

rn 

L(X l: X m \p, t, a) = Y[(pf A (Xi) + (1 - p)fo(X l )). 
i=i 

We estimate our model parameters in two steps. First we fix p and estimate r(p) 
and cr(p) using the EM algorithm. In the second step we estimate p by maximizing 
L(X±, . . . , X m \p, t (p), <j{p)) using numerical methods. We plug our estimated pa- 
rameters into (3.9) and denote the resulting Parametric Empirical Bayes Classifier 
as PEB1. As reported in Section 5, PEB1 performs very well for moderate values 
of p. However, when p is very small PEB1 has large FDR and MP. This behavior is 
related to the problems with idcntifiability of parameters of mixture distributions 
discussed in Section 6. Since our main interest is in sparse mixtures we consider 
the following modification of PEB1. Firstly we stabilize the performance of MLE 
by supplying the information included in the data with the prior information on p. 
Using a subjective, informative, prior on p is also strongly recommended in Scott 
and Berger [25], where the following prior density is proposed; 

(4.1) f(p)^P(l-pf- 1 . 

In simulations reported in Scott and Berger [25], the parameter (3 is set to be equal 
to 11, so the corresponding prior on p has its median close to 0.07. To adjust to the 
sparsity typical for QTL mapping experiments we slightly shift this prior towards 
and choose (3 so that the prior median is 0.03, which for m = 200 corresponds to 
6 signals on average. In our simulations (3 = 22.76. The results aren't sensitive to 
small changes in (3. 

The estimate of p is obtained by maximizing 

(4.2) logL(Xx, . . . , X m \p, r{p), a(p)) -((3-1) log(l - p) 

and can be interpreted as a mode of the "posterior" density of p. 

The second modification relies on replacing the maximum likelihood estimates of 
r(p) and a(p) with the moment estimates based on the fourth and the second mo- 
ment of the mixture distribution. We observe that using the fourth moment makes 
our procedure sensitive to the change in the tail of the mixture distribution and 
hence yields good results in a very sparse mixture case. The resulting Parametric 
Empirical Bayes Classifier is denoted by PEB2. 

When <7 is known PEB1 and PEB2 are constructed accordingly For PEB2 r(p) 
is estimated using the fourth moment of the mixture distribution. 

Modified BH: We use estimates of p and a computed by PEB methods to 
construct modified versions of BH, with the threshold specified by (3.4). The version 
based on MLE is denoted by BH1 and the sparse mixture version, based on the 
estimates derived by maximizing (4.2), by BH2. 

Full Bayes approach: We use the framework of Scott and Berger [25] and con- 
struct the full Bayes procedure minimizing the posterior Bayes risk corresponding 
to the 0-1 loss (1 for making type I or type II error). We use noninformative priors 
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for r 2 and cr 2 , suggested in Scott and Berger [25], with densities 



(4.3) 



ksb{o 2 ) = —n and tt S b{t 2 \<J 2 ) 



(a 2 + r 2 ) 2 ' 



When a is known only the prior for r is used. The prior on p is the same as the 
one used by PEB methods, namely (4.1), with (3 = 22.76. To compute the posterior 
probability of Hoi (see formula (9) of Scott and Berger [25]) Markov chain Monte 
Carlo (MCMC) is applied, which according to our simulations is more stable than 
the importance sampling suggested in Scott and Berger [25]. The hypothesis Hgi is 
rejected if 



and the resulting multiple testing procedure is denoted by SB. 

Remark 5. Note that minimizing the posterior Bayes risk is conceptually different 
from minimizing the risk BR\ \ (see (3.5)). BR\,i is conditional on the vector of 
parameters of the mixture model (2.1), while the posterior Bayes risk is conditional 
on the data and depends on the prior. However, Theorem 8.2 in the Appendix states 
that if the parameter space (i.e. p G (0, 1), a > and t > 0) and m increased then 
the misclassification probability of SB converges to the optimal value provided by 
the oracle (3.7). This result is a consequence of Theorem 8.1 on posterior consistency 
under the considered mixture model. Obviously, for each fixed m, the difference 
between misclassification probability of SB, and the oracle depends on the accuracy 
of prior assumptions and due to the choice of the prior on p we expect SB to resemble 
the oracle when the data are generated by the sparse mixture. 

Dirichlet mixtures: The procedures presented so far are based on the assump- 
tion that the distribution of the signals (of m's given 7; = 1) is completely known 
up to finitely many parameters. In practice, however, a lot less is known about the 
signals. A realistic model for such a situation is to consider : /J,i\(ji = 1) ~ P m9 for 
some unknown probability measure P SX9 with P stg ({0}) = 0, which doesn't need 
to be restricted to any parametric family. In this case, X^s arise as independent 
observations from the mixture density f(x) = J (f> a (x — /i)<iP(/z), where P is a 
probability measure that puts some positive mass at the point and distributes 
the remaining mass p according to P slg . A Bayesian analysis of this model is pos- 
sible by using a prior distribution on the space of such probability measures P. A 
suitable candidate is a Dirichlet process prior. Below we introduce a new procedure 
based on this model and prior. 

Let ttsb(t 2 , <J 2 ) = (<r 2 + r 2 )~ 2 denote the joint prior distribution on (r 2 ,cr 2 ) 
recommended in Scott and Berger [25]. We assume that 



where Dir(c,Po) denotes the Dirichlet process prior (see Ferguson [11]) with pre- 
cision constant c > and base measure Pq - a probability measure on the real 
line. Our choice of the base measure, namely Po = (1 — Po)$o +PoN(0, r 2 ), ensures 
that a random P ~ Dir(c, Pq) almost surely puts some positive mass 1 — p on 



(4.4) 



P(H 0i \X 1 ,...,X m ) <0.5 



X l \^ i ,P,a 2 ,T 2 1 p ,c 
H l \P,<j 2 ,T 2 ,p ,c 

P\(T 2 ,T 2 ,Po,C 

(p ,c, t 2 ,ct 2 ) 



N{^,a 2 ) 7 
P, 

Dir(c,(l-p )S {0} +p N(0,T 2 )), 

Beta(l, 22.76) x Gamma{l, 1) x tt S b{t 2 , a 2 ), 
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and distributes the remaining positive mass p on the real line according to some 
probability measure P slg , which is singular to <5{o}- Therefore, without any ambi- 
guity, we can import our familiar signal indicators 7$ into this model by defining 
7i = I(JH + 0). 

Note that the priors for po and (r 2 , a 2 ) match the priors chosen for these param- 
eters in the model proposed in Scott and Berger [25] and presented in the previous 
section. The precision constant c is modeled with a Gamma(l, 1) prior, which is 
quite diffuse with a mean equal to 1 - a conventional choice of this parameter. 

Toward implementation of this model, we first integrate out P from the hier- 
archical structure by using the Polya urn representation of a Dirichlet process. 
Although our specification includes an improper prior on a 2 , it turns out that the 
posterior distribution of (/ii, . . . , jU m , a 2 , r 2 ,po) is indeed a proper distribution; see 
Theorem A. 3 in Appendix. This allows us to obtain an MCMC sample of obser- 
vations (fj,i , . . . ,fJ-m), I = l,...,L, from the posterior distribution of /Ltj's given 
the data. We use the algorithm described in Escobar and West [10] with suitable 
adaptations to our model. Our model differs from the one considered in Escobar 
and West [10] in two aspects: 1. wc consider only a location mixture of normals and 
2. our base measure has a point mass at {0}. The adaptations, however, are not 
complicated and we omit further details. 

With the sample of fXi's collected from the posterior we calculate P(7i = 0\Xi, 
...,X n ) m l-i^P = 0)- A s before, we reject H oi if this estimate is smaller 

than 0.5. We denote this multiple testing procedure by DPP. 

Approximate nonparametric Bayes procedure based on Newton's al- 
gorithm, NPBN: A somewhat related procedure can be obtained by combining 
the above nonparametric model with Newton's algorithm (see Newton [21]), which 
produces an easy to compute, recursive estimate of the distribution P. In particu- 
lar, we start with an initial guess of P given by Po = (1 — Po)S{o} +PoN(0, t 2 ) and 
then recursively update this guess as 

Piid/J,) = (1 - Wi)Pi-l{dn) +W1-F—, 75 7TT, 

where Wi <G (0,1) are prefixed weights (we take Wi = (i + l) -1 )- We take the 
final update P m as the estimate of P. Note that the estimate P m , too, puts some 
positive mass 1 — p m at and distributes the rest according to some density f m on 
the real line. Testing is then performed by mimicking the Oracle rule and replacing 
P with the estimate P m : we reject Hqi if (l/p m — l)4> a (xi)/ J § a (xi — p)f m {p)diJL < 
1. We call this procedure Nonparametric Baycsian Procedure based on Newton's 
algorithm (NPBN). 

For every i, if one models Xi ~ J <f> a (x — n)P(dfj) with P ~ Dir(l, p-i) then 
the posterior expectation of P given the singleton sample {xi} equals Pj. In spite 
of this resemblance, NPBN should not be taken as an approximation to DPP. The 
former, however, has its own set of advantages. 

The biggest advantage of using NPBN is that it produces extremely fast com- 
putation while using a nonparametric model. The reason for its speed stems from 
the one pass routine employed by the algorithm. 

The output of the NPBN procedure depends on the order in which data are fed 
to the algorithm. In our simulations wc align the observations in their ascending 
order of magnitude. With this alignment, Pj's arc first trained on small observa- 
tions, which are mostly noise, followed by the large ones coming mainly from the 
signals. However, as the later updates arc rather less influential (small wi), the 
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concentration of P m on would be systematically inflated. Therefore the chosen 
alignment would systematically result in a more conservative procedure than what 
a random alignment would produce. Such a conservative approach is well suited to 
our anticipation of a small to moderate number of signals. 

While greater speed is a selling point for NPBN, it does suffer from some in- 
flexibility in model specification. Unlike DPP, the NPBN setting docs not allow 
a further prior specification on the parameters pq,t 2 ,g 2 . It is hard to generalize 
the recursive algorithm to include unknown parameters. In the present paper we 
consider NPBN only when a is known and specify r 2 = er 2 , which is equal to the 
mean of the prior distribution on t 2 used by SB and DPP. 

We choose po as 

This quantity is equal to the proportion of rejections one would make by assuming 
P = hSo + ^N(0,r 2 ). This choice calibrates po to r 2 in a natural way - once r 2 is 
picked we update our noniformative choice of po = 1/2 by using this chosen value 
of r 2 . Our simulation study indicates that this data dependent choice of po leads 
to an overall higher efficiency compared to any fixed choice of po . 

5. Simulation results 

Table 3 and Figure 2 demonstrate characteristics of SB, PEB1, PEB2, BHI, BH2 
and NPBN. "Efficiency", represented in Figures 2(a) and (c), is defined as 

MP of the oracle 

E = ; . 

MP of a given procedure 

We do not report the results of the original BH since the performance of BH2 is 
systematically better. The parameter values used in the simulations are m = 200, 
(7 = 1 and r = y/2 * log(200) w 3.26. Due to the computational complexity the 
results for SB are based on 3000 replicates. The results of all other procedures are 
based on 10000 replicates. The large scale simulations were not feasible for DPP, 
which is not represented in Table 3 and Figure 2. 

Table 3 demonstrates that for p < 0.05 PEB1 and BHI have large MP and FDR. 
The properties of these rules quickly improve with increasing p and for psiO.2 MP 
of PEB1 is close to optimal and FDR of BHI is close to 0.05. When a is known 
the characteristics of PEB1 and BHI stay at the assumed level for all p > 0.2 
but when a is unknown these rules deteriorate again when p > 0.8. The sparse 
mixture rules: SB, PEB2 and BH2, perform well for very small p. When a is known 
these rules retain good properties for p £ [0, 0.6] but when a is unknown they 
deteriorate already atp~ 0.3. Figure 2(d) demonstrates that at this point all sparse 
mixture rules start to loose power and become overly conservative. The reason for 
this behavior as well as the corresponding loss of power for PEB1 and BHI when 
p > 0.8 is the difficulty with identifying the model parameters, discussed in detail 
in Section 6. 

Figures 2(a) and (c) demonstrate the "efficiencies" of the sparse mixture rules in 
the most interesting range p < 0.2. PEB1 and BHI are not represented since their 
"efficiencies" for p < 0.03 are below 50%. Figures 2(a) and (c) show that when 
p £ [0.01,0.03] BH2 is almost optimal and has the "efficiency" slightly larger than 
the "efficiencies" of other procedures. However, this characteristic of BH2 system- 
atically decreases with an increase of p and at p — 0.2 it is substantially smaller 
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Table 3 

FDR and Misclassification Probability of multiple testing procedures. BO stands for 

the Bayes Oracle (3.9) 



<y known 



o~ unknown 



p 


BO 


SB 


PEB1 


PEB2 


BH1 


BH2 


NPBN 


SB 


PEB1 


PEB2 


BH1 


BH2 










Misclassification probability in 










0.0 





0.01 


82.4 


0.04 


73.2 


0.03 


0.01 


0.02 


31.3 


0.04 


14.3 


0.03 


0.025 


1.76 


1.8 


19.3 


1.77 


16.8 


1.77 


1.82 


1.84 


11.0 


1.81 


5.11 


1.80 


0.05 


3.36 


3.38 


7.01 


3.40 


6.19 


3.42 


3.46 


3.48 


6.36 


3.48 


4.35 


3.51 


0.2 


11.7 


11.8 


11.8 


11.8 


12.1 


12.2 


11.9 


12.4 


12.2 


12.3 


12.3 


13.0 


0.5 


23.5 


24.5 


24.1 


24.5 


25.5 


26.2 


24.0 


35.5 


24.9 


40.0 


26.4 


42.6 


0.8 


20.0 


29.6 


21.1 


29.5 


28.8 


35.1 


22.2 


79.6 


30.2 


79.7 


41.7 


79.9 












False Discovery Rate 


in % 










0.0 





3.1 


93.1 


6.0 


79.4 


5.2 


2.4 


4.4 


49.4 


5.5 


42.4 


5.6 


0.025 


9.4 


7.8 


31.5 


7.2 


20.5 


5.0 


5.5 


8.6 


28.0 


6.3 


18.7 


4.1 


0.05 


11.2 


8.7 


17.5 


8.0 


8.0 


1.9 


6.9 


9.2 


19.9 


7.0 


11.0 


3.9 


0.2 


12.2 


9.2 


12.9 


8.1 


5.0 


4.7 


8.6 


5.5 


13.9 


7.6 


6.5 


2.9 


0.5 


13.9 


8.0 


14.7 


8.0 


5.3 


4.1 


11.0 


1.1 


14.6 


2.0 


6.2 


0.7 


0.8 


20.0 


5.0 


17.2 


5.1 


6.1 


2.4 


14.8 


0.0 


13.1 
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4.4 
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Fig 2. Characteristics of multiple testing procedures. 



than the "efficiencies" of PEB2 and SB. The "efficiencies" of PEB2 and SB stay 
constant at the level close to 99% when a is known and only slightly decrease to 
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94.5% at p = 0.2 when a is unknown. 

When a is known the nonparametric NPBN procedure performs surprisingly 
well over the entire range of p. It is slightly less efficient than SB and PEB2 when 
p < 0.2 but it retains good properties also for p close to 1. 

In order to get a feeling of the performance of DPP we compare it with SB on a 
case by case basis with the help of a few toy data sets. We generated 10 data sets, 
each of size 200, from the model described in Section 1, with a = 1, r = \/1 log 200 
and various values of p. Five of these data sets are represented in Figure 3. On each 
panel, two scatter plots of Pr(7,; = 0|xi, . . . , x m ) versus x% are presented. The open 
circles joined by the solid line correspond to DPP, whereas the filled circles joined 
by the dotted line correspond to SB. The left column in Figure 3 corresponds to 
the known a case - i.e., both DPP and SB are employed with a 2 fixed at 1 and the 
conditional prior 7r(r 2 |cr 2 = 1) is used for r 2 . The right column corresponds to the 
unknown a case. 

It appears that DPP and SB perform quite similarly, although the former is a 
little more conservative than the latter, particulary when the number of signals is 
very small. This is further illustrated by Table 4. The ten columns in the table 
represent the ten data sets, with the number of signals shown on the header row. 
In each cell of the body of the table, the two values give the numbers of correct 
and incorrect discoveries of signals made by the corresponding procedure for that 
particular data set. From Table 4 we also note that DPP and NPBN are quite 
similar except for samples with many signals (last two columns), where the prior 
on p used by DPP was strongly inappropriate. 

5.1. When the prior assumption is wrong 

In this section we demonstrate the results of simulations illustrating the perfor- 
mance of our methods in the situation when the assumption that the distribution 
of fii under the alternative is normal does not hold. For this simulation we consider 
the case with a known and generate u^s using a symmetrized gamma distribution 
instead of normal distribution. 

Let g{x, r, u) denote the density of the gamma distribution with the shape pa- 
rameter r and the scale parameter u. The symmetrized gamma density describ- 
ing the prior distribution of pi under the alternative is given by the equation 

gA(x) = 0.5g(\x\, r, u). For the current simulation we use r = 4 and u = 2 V 2 J^_ m _^ 
As demonstrated in Figure 4, the deviation from the prior assumption strongly 
affects the behavior of PEB1 and BH1, which are based on the maximum likelihood 
estimates of parameters under the wrong mixture model (2.1). In particular, for 
p £ (0.3, 0.8) these procedures are much too liberal and do not achieve the assumed 
characteristics. The misclassification probability of PEB1 is close to optimal only in 
the range of p £ (0.1, 0.2) and p > 0.9. Also, only in this range FDR of BH1 is close 
or below the assumed value of 0.05. Over the entire range of p the misclassification 
probability of the nonparametric procedure NPBN is decisively smaller than for 
PEB1, which clearly demonstrates the advantage of using nonparametric methods in 
case when the prior distribution is not known. Surprisingly, PEB2 and BH2, which 
are based on the moments estimates under the wrong model and use the strongly 
informative prior on p, behave very well over the entire range of p. We believe that 
their good behavior for moderate and large p is just a coincidence, resulting from 
the opposite influence of different types of errors of our estimation procedure. As 
noted before, under this particular violation of the prior assumption the methods 
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Fig 3. A case by case comparison of DPP and SB for known a: scatter plots of Pr(7i = 
0|xi, . . . , in) vs. Xj. Open circles connected by solid line represent DPP, filled circles connected 
by dotted line represent SB. Left column depicts the situation when a is known, right when a is 
unknown. 



which do not use the prior on p are too liberal when p £ (0.3, 0.8). In case of PEB2 
and BH2 this error seems to cancel the error resulting from using the informative, 
but not adjusted to this range, prior on p. However, other simulations, not reported 
in this paper, suggest that the good behavior of PEB2 and BH2 for p < 0.2 is a 
quite general rule, working under a wide set of different, also asymmetric, prior 
distributions on fj,{. A theoretical explanation of this phenomenon still needs to be 
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Table 4 

A comparison of the numbers of correct and false discoveries of DPP, SB and NPBN 
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Fig 4. Characteristics of multiple testing procedures when the assumption on the prior distribution 
of fii is violated. 



worked out. 



6. Problems with identiflability of model parameters 

An early treatment of the problem caused by lack of identiflability of mixtures 
can be found in Ghosh and Sen [15]. Another recent reference is Elmore et al. [9]. 
In this section we illustrate this problem by a numerical study on the Kullback- 
Leibler distance between different mixture densities and the resulting behavior of 
the maximum likelihood method. 

Consider the problem of a choice between two competing probability models 
Mi and M2, characterized by the density functions /i(a;) and /^(x)- Let K12 = 
/ [log( ^^j )]/i(x)dx denote the Kullback-Leibler distance between these two dis- 
tributions and let V12 = J [log( )] 2 f\(x)dx. Further assume that V12 is finite. 
We consider the case when no prior information is available and our choice of the 
model depends only on the likelihood of the data under Mi and M2. 

Assume now that a sequence of i.i.d. data Xi, . . . ,X m is generated according 
to the model Mi. Let Li = 11';= 1 fi(Xi) denote the corresponding likelihood. Let 
L2 = n"=i .h(Xi) denote the likelihood of the data under the wrong model M2. 
The probability that the likelihood points at the the wrong model M2 is equal to 
P(\ogLi < logi^). Let us denote D12 = logii — log £2- Note that 
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Thus, by the Central Limit Theorem, for sufficiently large m the distribution of 
D\2 can be approximated by the normal distribution with mean equal to mKyi 
and variance mV\2- 

Consider now the case when the models M\ and M 2 belong to the class of mixture 
densities specified in (2.1). The parameters for the model Mi are p = 0.01 , <t\ = 1 
and Ti = -y/21og200 ss 3.26 and the corresponding parameters for the model M 2 are 
P2 = 1, 02 = 1 and t 2 = ^/pTi « 0.326. The parameters for the model M2 are chosen 
in such a way that the probability distributions corresponding to M\ and M2 have 
the same variance. We used the Monte Carlo method to calculate Kyi ~ 0.083 and 
V12 w 0.33. Thus E(L>i 2 ) « 3.93, Var(£>i 2 ) ~ 66.58 and a probability of making a 
wrong decision P(Di2 < 0) ~ 0.31. Note that while the Kullback Leibler distance 
between M\ and M 2 is rather small these models are completely different in the 
percentage of alternative hypothesis and the resulting testing procedures give very 
different results. Wrong decision of accepting the model M 2 leads to a rejection 
of all null hypothesis, while in reality about 99% of them are true. Interestingly, 
the probability of wrongly detecting the corresponding "full" model M 2 quickly 
decreases with an increase of p. This dependence is demonstrated in Figure 5(a). 
The described phenomenon appears when a is known and unknown and forces us to 
use the informative prior distribution on p when testing is performed in the sparse 
mixture setting. 

In case when a is unknown we observe a parallel problem with identifying the 
parameters of the mixture distributions with large values of p. For example con- 
sider the model Mi with p = 0.95 , o\ — 1 and t\ = ^2 log 200 ~ 3.26 and the 
corresponding "null" model M 2 with p 2 = and er 2 = \/ o\ + prf ~ 3.33. For 
this example Kyi ~ 0.0013, Vyi ~ 0.0505 and a probability of making an error 
P(D±2 < 0) « 0.37. Similarly as before, choosing M 2 instead of M% leads to a 
completely wrong testing procedure (i.e. accepting all hypotheses). In this situa- 
tion, probability of making a wrong decision increases with p and is illustrated in 
Figure 5(b). The described phenomenon causes the power of our testing procedures 
to diminish when the fraction of alternatives exceeds a certain threshold value, as 
demonstrated in Figure 2(d). 



(a) wrong choice of the model with p = (b) wrong choice of the model with p = 




Fig 5. Probability of making a wrong model choice as a function of true p. 
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7. Conclusions 

We have examined several multiple testing procedures keeping in mind both FDR 
and decision theoretic criteria like MP (Misclassification Probability), efficiency 
(MP of oracle/MP) and power. We also studied the robustness to some deviations 
from the assumed prior distribution and compared our fully parametric methods 
with methods based on nonparametric priors/mixing distributions as in Dirichlct 
process mixtures or Newton's algorithm. 

We observed that if a is known then most methods tend to perform poorly at 
one of the two extremes. The MLE-based methods (PEB1, BH1) suffer near p = 
due to near lack of identifiability. On the other hand procedures that make use of 
a conservative prior on p (PEB2, BH2, SB) tend to be too conservative near p = 1. 
Surprisingly, the NPBN procedure, based on Newton's algorithm, does well over 
the entire range. 

Our results confirm the observation of Genovese and Wasserman [12] that for 
very small p's (for m = 200, p < 0.03) the misclassification error of BH is close to 
optimal in the sense of attaining a Bayesian oracle. In this range of p BH works 
similarly to the Bayes oracle also in terms of FDR and the power. However for 
p > 0.03 the Bayes oracle becomes much more liberal than BH and allows to obtain 
much smaller misclassification rate. Interestingly, the misclassification probability 
of the modified version of BH, which uses the knowledge on p, is comparable to 
the misclassification probability of the Bayes oracle over the entire range of p. This 
happens even though the False Discovery Rate and the power of these two are quite 
different. Our simulations demonstrate that Empirical Bayes methods can be used 
to estimate p and construct modified versions of BH when the model parameters 
are unknown. 

An interesting methodological fact is that in case when a is unknown all of the 
considered procedures break down for relatively large p. It is somewhat unexpected 
that one would have a problem when p, i.e. the proportion of signals, is large. 
Section 6 explains how this arises due to the nonidentifiability of mixtures. 

The above facts have interesting as well as useful implications for the two appli- 
cations discussed in Section 2. If our main interest is in QTL and polygenic effects, 
then a is due to random effects and can be estimated well by appropriate repli- 
cation. This will virtually reduce the case of unknown a to the known a case and 
improve the quality of inference. On the other hand, if our goal is QTL mapping 
alone, then a represents both random effects and polygenic effects and hence can 
not be directly estimated even with replication. But fortunately for the range of 
p that is relevant for QTL mapping, namely p < 0.2, unknown a does not cause 
problems at least for m > 200 (see Figure 2(c)). 



Appendix 

Theorem A.l. Let X\, . . . , X m be the sequence of i.i.d. rv's with the density spec- 
ified by (2.1). Assume that the unknown vector of parameters 9q = {po,o-q,Tq) is 
in the interior of the parameter space fl = [0,1] x R + x R + . Moreover, assume 
that the prior density is continuous and positive at 9q and that there exists mo G N 
such that the corresponding posterior distribution H(- . . . ,x m ) is proper when 
m > m . Then the posterior distribution is consistent, i.e. with probability 1 for 
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every Euclidean neighborhood U of 6$ it holds 

(A.l) {a(U\X u ...,X m )}^-l asHwoo. 

Theorem 4.4.1 of Ghosh and Ramamoorthi [14] shows that the posterior prob- 
ability of any weak neighborhood of the mixture distribution Pg tends to one as 
m tends to infinity. However, the same result for an Euclidean neighborhood of 
the true parameter 9q requires considerably more work. We omit the proof to save 
space. 

Theorem A. 2. The misclassification probability of the full Bayes procedure SB, 
specified in (4-4): converges to the optimal misclassification probability provided by 
the Bayes oracle (3.9). 

Theorem A. 2 essentially follows from Theorem A.l and regularity properties 
of the mixture density, but the full proof, though along standard lines, is also 
somewhat long and hence omitted. 

Theorem A. 3. The joint posterior distribution of (fj,x, . . . , H n ,P, t 2 , a 2 , c) under 
the DPP specification is proper. 

Proof. We need to show that 



/ = 



m 1 



■. exp 



{Xi - 
2a 2 



Tr(dfii, . . . , dp, m , da 2 , dr 2 , dp) 



is finite. Integrating out P in the hierarchical specification of DPP leads to the 
following joint conditional distribution of the /li's: 



Mi, 
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From this it can be shown that / equals 
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where the sum is taken over all So C {1, . . . , m}, $ G ^({1, • ■ ■ , Tn] \ So) - the 
collection of all partitions of {1, ... , to} \ 5*0, 
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From this one can show that / < oo by a direct verification of the finiteness of 
each of the integrals entering the above finite sum. This exercise can be carried out 
by 1. substituting Z = Jj, 2. integrating out a 1 and 3. by using the fact that t 2 /a 2 
admits a proper density. □ 
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